Hello and welcome to my first GitHub site! My name is Mia and I’m a college student in Saint Paul, Minnesota. I’ve been learning R for the past few months in my “Intro to Data Science” class, but I am still very much a beginner. Throughout this process, I have relied heavily on internet tutorials like the ones on r-bloggers.com, so I decided to make my own. I hope this tutorial can be helpful for another beginner. Coding is intimidating, but if I can do it, so can you!

In this tutorial, we’ll be exploring Spotify data – how to access it using the Spotify API and spotifyr wrapper package, as well as what all those variables actually mean. I’ve found that Spotify is a great site to get data from because the information is so widely available and they have really unique indices to quantify music. In this tutorial, we’ll be exploring three of these indices: key, speechiness, danceability. I’ll describe what each of them mean musically, then show different ways to dynamically represent them using ggplot2 and plotly.

library(dplyr)
library(spotifyr)
library(plotly)
  1. The first step in accessing Spotify data is to get an API key. To do so, log in to your dashboard on the “Spotify for Developers” page.

  1. Select “Create a Client ID” and fill out the required questions. This will take you to a page that shows your Client ID and Client Secret.

  2. Add the following code to your R markdown:

id <- ‘your client ID’
secret <- ‘your client secret’
Sys.setenv(SPOTIFY_CLIENT_ID = id)
Sys.setenv(SPOTIFY_CLIENT_SECRET = secret)
access_token <- get_spotify_access_token()
  1. Now that you have your Spotify access token, you can begin getting data using spotifyr. In this example, I wanted to compare the Top 50 playlist from four different countries (Taiwan, France, Bolivia, and the U.S.). To do so, I manually added the songs from the four Top 50 playlists to new new playlists in my own account. This is a bit tedious, but hey – we’re beginners here! And it works!

  2. Use the get_user_playlists, get_playlist_tracks, and get_track_audio_features functions and your own Spotify id to retrieve data about all the songs on the playlists.

my_id <- 'your spotify id'
my_plists <- get_user_playlists(my_id)

my_plists2 <- my_plists %>%
  filter(playlist_name %in% c('Taiwan Top 50', 'France Top 50', 'Bolivia Top 50', 'U.S. Top 50'))

tracks <- get_playlist_tracks(my_plists2)
features <- get_track_audio_features(tracks)
  1. Do a left_join to join the two tables (playlist tracks and track features) by the “track_uri” column.
tracks2 <- tracks%>%
  left_join(features, by="track_uri")

Check out all those cool variables! We’re going to explore three of them today, starting with speechiness.

The Spotify “Get Audio Features” page says that speechiness refers to the presence of spoken words in a song. Songs with a speechiness score between 0.33 and 0.66 contain both music and speech; they could be rap songs, for example. Based on this, we’re going to look at speechiness based on the difference between the speechiness score and 0.33. If the difference is above 0, it’s most likely a rap song. The farther below 0, the more instrumental the track is.

  1. Use mutate to create a new column that calculates a speechiness difference score by subracting 0.33 from the speechiness.
tracks2 <- tracks2%>%
  mutate(difference=speechiness-0.33)
  1. For the sake of ease and aesthetics, I specified my colors. This step is optional, but it happens to be my favorite part of making visualizations.
green <- "#1ed760"
yellow <- "#e7e247"
darkgray <- "#254441"
pink <- "#ff6f59"
blue <- "#17bebb"
  1. I used ggplot2 to make a geom_col of the speechiness difference scores and faceted them by country to make it easier to compare the four. Since the main point of the graph is not necessarily to show the numerical speechiness difference score, but rather how far each bar goes above or below zero, I took out the grid lines. I think this also makes it look more sleek. I used ggplotly to make the graph interactive so users can zoom in and see the track, artist, and speechiness each bar represents.
viz1 <- ggplot(tracks2, aes(x=reorder(track_name, -difference), y=difference, fill=playlist_name, 
                          text=(paste("Track:", track_name, "<br>",
                                      "Artist:", artist_name, "<br>",
                                      "Speechiness:", speechiness))))+
  geom_col()+
  scale_fill_manual(values=c(green, yellow, pink, blue))+
  theme_minimal()+
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.ticks.y=element_blank(),
        panel.grid.major = element_blank(),
        legend.position="none")+
  ylab("Speechiness Difference")+
  facet_wrap(~ playlist_name)

ggplotly(viz1, tooltip=c("text"))

France has more bars above zero than any of the other countries, which implies that there are more rap songs on the France Top 50 than other playlists Taiwan is the only country with no bars above zero, and it has a much higher concentration farther below zero, so its Top 50 must have more instrumental tracks or tracks with very little spoken word presence.